Background

This file is designed as a subset of the code contained in Coronavirus_Statistics_v002.Rmd. This file includes the latest code for analyzing data from The COVID Tracking Project. The COVID Tracking Project contains data on positive tests, hospitalizations, deaths, and the like, for coronavirus in the US. Downloaded data are unique by state and date.

Companion code for functions is in Coronavirus_Statistics_CTP_v003.R and Coronavirus_Statistics_Shared_v003.R. The code leverages tidyverse and a variable mapping file throughout:

# All functions assume that tidyverse and its components are loaded and available
# Other functions are declared in the sourcing files or use library::function()
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.4     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
# If the same function is in both files, use the version from the more specific source
source("./Coronavirus_Statistics_Functions_Shared_v003.R")
source("./Coronavirus_Statistics_Functions_CTP_v003.R")

# Create a variable mapping file
varMapper <- c("cases"="Cases", 
               "newCases"="Increase in cases, most recent 30 days",
               "casesroll7"="Rolling 7-day mean cases", 
               "deaths"="Deaths", 
               "newDeaths"="Increase in deaths, most recent 30 days",
               "deathsroll7"="Rolling 7-day mean deaths", 
               "cpm"="Cases per million",
               "cpm7"="Cases per day (7-day rolling mean) per million", 
               "newcpm"="Increase in cases, most recent 30 days, per million",
               "dpm"="Deaths per million", 
               "dpm7"="Deaths per day (7-day rolling mean) per million", 
               "newdpm"="Increase in deaths, most recent 30 days, per million", 
               "hpm7"="Currently Hospitalized per million (7-day rolling mean)", 
               "tpm"="Tests per million", 
               "tpm7"="Tests per million per day (7-day rolling mean)"
               )

Running Code

The main function is readRunCOVIDTrackingProject(), which performs multiple tasks:

STEP 1: Extracts a file of population by state (by default uses 2015 population from usmap::statepop)
STEP 2a^: Downloads the latest data from COVID Tracking Project if requested
STEP 2b^: Reads in data from a specified local file (may have just been downloaded in step 2a), and checks control total trends against a previous version of the file
STEP 3^: Processed the loaded data file for keeping proper variables, dropping non-valid states, etc.
STEP 4^: Adds per-capita metrics for cases, deaths, tests, and hospitalizations
STEP 5: Adds existing clusters by state if passed as an argument to useClusters=, otherwise creates new segments based on user-defined parameters
STEP 6^^: Creates assessment plots for the state-level clusters
STEP 7^^: Creates consolidated plots of cases, hospitalizations, deaths, and tests
STEP 8^^: Optionally, creates plots of cumulative burden by segments and by state
STEP 9: Returns a list of key data frames, modeling objects, named cluster vectors, etc.

^ The user can instead specify a previously processed file and skip steps 2a, 2b, 3, and 4. The previously processed file needs to be formatted and filtered such that it can be used “as is”
^^ The user can skip the segment-level assessments by setting skipAssessmentPlots=TRUE

Broadly, there are several use cases for the function:

  1. Download new data, run new segments, and assess the segments (all steps)
  2. Download new data, use existing segments, and assess the segments (steps 1-4 and 6-9)
  3. Use previously processed data, explore and assess various segmenting methodologies (steps 1 and 5-9)
  4. Create a list from existing state burden and cluster data for another purpose (steps 1, 5, and 9)

An example for each use case is created, with the caveat that data are not repeatedly downloaded (process is cached) to avoid unnecessary calls to the COVID Tracking Project server.

Further, files can be saved in RDS format so they can be loaded and used later.

Use Case 1: Download new data, create segments, assess performance

The full process downloads data, creates segments, and assesses performance. Hierarchical segmentation with a heavy focus on deaths vs. cases tends to work well for creating state-level clusters:

# Create segments and download data from COVID Tracking Project
# Create 6 segments but place Vermont (a very small state and dendrogram outlier) in the New Hampshire segment
locDownload <- "./RInputFiles/Coronavirus/CV_downloaded_201025.csv"
test_hier5_201025 <- readRunCOVIDTrackingProject(thruLabel="Oct 24, 2020", 
                                                 downloadTo=if(file.exists(locDownload)) NULL else locDownload,
                                                 readFrom=locDownload, 
                                                 compareFile=readFromRDS("test_hier5_201001")$dfRaw,
                                                 hierarchical=TRUE, 
                                                 reAssignState=list("VT"="NH"), 
                                                 kCut=6, 
                                                 minShape=3, 
                                                 ratioDeathvsCase = 5, 
                                                 ratioTotalvsShape = 0.5, 
                                                 minDeath=100, 
                                                 minCase=10000
                                                 )
## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_double(),
##   state = col_character(),
##   totalTestResultsSource = col_character(),
##   dataQualityGrade = col_character(),
##   lastUpdateEt = col_character(),
##   dateModified = col_datetime(format = ""),
##   checkTimeEt = col_character(),
##   dateChecked = col_datetime(format = ""),
##   fips = col_character(),
##   hash = col_character(),
##   grade = col_logical()
## )
## i Use `spec()` for the full column specifications.
## 
## File is unique by state and date
## 
## 
## Overall control totals in file:
## # A tibble: 1 x 3
##   positiveIncrease deathIncrease hospitalizedCurrently
##              <dbl>         <dbl>                 <dbl>
## 1          8531788        216646               8686442
## 
## *** COMPARISONS TO REFERENCE FILE: compareFile
## 
## Checkin for similarity of: column names
## In reference but not in current: 
## In current but not in reference: probableCases
## 
## Checkin for similarity of: states
## In reference but not in current: 
## In current but not in reference: 
## 
## Checkin for similarity of: dates
## In reference but not in current: 
## In current but not in reference: 2020-10-24 2020-10-23 2020-10-22 2020-10-21 2020-10-20 2020-10-19 2020-10-18 2020-10-17 2020-10-16 2020-10-15 2020-10-14 2020-10-13 2020-10-12 2020-10-11 2020-10-10 2020-10-09 2020-10-08 2020-10-07 2020-10-06 2020-10-05 2020-10-04 2020-10-03 2020-10-02 2020-10-01
## 
## *** Difference of at least 5 and difference is at least 1%:
## Joining, by = c("date", "name")
##         date             name newValue oldValue
## 1 2020-03-28 positiveIncrease    19925    19692
## 2 2020-03-28    deathIncrease      544      538
## 3 2020-03-29 positiveIncrease    19348    19581
## 4 2020-03-29    deathIncrease      515      521
## Joining, by = c("date", "name")
## Warning: Removed 24 row(s) containing missing values (geom_path).
## 
## 
## *** Difference of at least 5 and difference is at least 1%:
## Joining, by = c("state", "name")
##   state             name newValue oldValue
## 1    HI positiveIncrease    12469    12289
## Rows: 13,157
## Columns: 55
## $ date                        <date> 2020-10-24, 2020-10-24, 2020-10-24, 20...
## $ state                       <chr> "AK", "AL", "AR", "AS", "AZ", "CA", "CO...
## $ positive                    <dbl> 13535, 183276, 105318, 0, 236772, 89281...
## $ probableCases               <dbl> NA, 26330, 7105, NA, 5417, NA, 6492, 26...
## $ negative                    <dbl> 539585, 1138922, 1181805, 1616, 1462194...
## $ pending                     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ totalTestResultsSource      <chr> "totalTestsViral", "totalTestsViral", "...
## $ totalTestResults            <dbl> 552746, 1295868, 1280018, 1616, 1693549...
## $ hospitalizedCurrently       <dbl> 58, 920, 606, NA, 819, 3007, 550, 233, ...
## $ hospitalizedCumulative      <dbl> NA, 19595, 6707, NA, 21043, NA, 8557, 1...
## $ inIcuCurrently              <dbl> NA, NA, 242, NA, 191, 744, NA, NA, 21, ...
## $ inIcuCumulative             <dbl> NA, 2021, NA, NA, NA, NA, NA, NA, NA, N...
## $ onVentilatorCurrently       <dbl> 8, NA, 94, NA, 87, NA, NA, NA, 6, NA, N...
## $ onVentilatorCumulative      <dbl> NA, 1157, 808, NA, NA, NA, NA, NA, NA, ...
## $ recovered                   <dbl> 6939, 74439, 93977, NA, 39525, NA, 7463...
## $ dataQualityGrade            <chr> "A", "A", "A+", "D", "A+", "B", "A", "B...
## $ lastUpdateEt                <chr> "10/24/2020 03:59", "10/24/2020 11:00",...
## $ dateModified                <dttm> 2020-10-24 03:59:00, 2020-10-24 11:00:...
## $ checkTimeEt                 <chr> "10/23 23:59", "10/24 07:00", "10/23 20...
## $ death                       <dbl> 68, 2866, 1797, 0, 5869, 17311, 2076, 4...
## $ hospitalized                <dbl> NA, 19595, 6707, NA, 21043, NA, 8557, 1...
## $ dateChecked                 <dttm> 2020-10-24 03:59:00, 2020-10-24 11:00:...
## $ totalTestsViral             <dbl> 552746, 1295868, 1280018, 1616, NA, 176...
## $ positiveTestsViral          <dbl> 11644, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ negativeTestsViral          <dbl> 540786, NA, 1181805, NA, NA, NA, NA, NA...
## $ positiveCasesViral          <dbl> 13535, 156946, 98213, 0, 231355, 892810...
## $ deathConfirmed              <dbl> 68, 2680, 1640, NA, 5581, NA, NA, 3674,...
## $ deathProbable               <dbl> NA, 186, 157, NA, 288, NA, NA, 903, NA,...
## $ totalTestEncountersViral    <dbl> NA, NA, NA, NA, NA, NA, 1790404, NA, 49...
## $ totalTestsPeopleViral       <dbl> NA, NA, NA, NA, 1693549, NA, 1124409, N...
## $ totalTestsAntibody          <dbl> NA, NA, NA, NA, 312232, NA, 179232, NA,...
## $ positiveTestsAntibody       <dbl> NA, NA, NA, NA, NA, NA, 12741, NA, NA, ...
## $ negativeTestsAntibody       <dbl> NA, NA, NA, NA, NA, NA, 166491, NA, NA,...
## $ totalTestsPeopleAntibody    <dbl> NA, 63359, NA, NA, NA, NA, NA, NA, NA, ...
## $ positiveTestsPeopleAntibody <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ negativeTestsPeopleAntibody <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ totalTestsPeopleAntigen     <dbl> NA, NA, 46505, NA, NA, NA, NA, NA, NA, ...
## $ positiveTestsPeopleAntigen  <dbl> NA, NA, 7891, NA, NA, NA, NA, NA, NA, N...
## $ totalTestsAntigen           <dbl> NA, NA, 21856, NA, NA, NA, NA, NA, NA, ...
## $ positiveTestsAntigen        <dbl> NA, NA, 3300, NA, NA, NA, NA, NA, NA, N...
## $ fips                        <chr> "02", "01", "05", "60", "04", "06", "08...
## $ positiveIncrease            <dbl> 374, 2360, 1183, 0, 890, 5945, 1350, 0,...
## $ negativeIncrease            <dbl> 0, 5064, 11643, 0, 11213, 119941, 9351,...
## $ total                       <dbl> 553120, 1322198, 1287123, 1616, 1698966...
## $ totalTestResultsIncrease    <dbl> 0, 6095, 12517, 0, 12080, 125886, 25756...
## $ posNeg                      <dbl> 553120, 1322198, 1287123, 1616, 1698966...
## $ deathIncrease               <dbl> 0, 7, 15, 0, 4, 49, 6, 0, 0, 2, 76, 42,...
## $ hospitalizedIncrease        <dbl> 0, 0, 29, 0, 76, 0, 79, 0, 0, 0, 174, 9...
## $ hash                        <chr> "280ee400bd797c20b77218c9e54a0b6615f91a...
## $ commercialScore             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ negativeRegularScore        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ negativeScore               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ positiveScore               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ score                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ grade                       <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## 
## 
## Control totals - note that validState other than TRUE will be discarded
## 
## # A tibble: 2 x 6
##   validState   cases deaths  hosp     tests     n
##   <lgl>        <dbl>  <dbl> <dbl>     <dbl> <dbl>
## 1 FALSE        66880    888    NA    471313  1115
## 2 TRUE       8464908 215758    NA 130976369 12042
## Rows: 12,042
## Columns: 6
## $ date   <date> 2020-10-24, 2020-10-24, 2020-10-24, 2020-10-24, 2020-10-24,...
## $ state  <chr> "AK", "AL", "AR", "AZ", "CA", "CO", "CT", "DC", "DE", "FL", ...
## $ cases  <dbl> 374, 2360, 1183, 890, 5945, 1350, 0, 97, 160, 4471, 1846, 14...
## $ deaths <dbl> 0, 7, 15, 4, 49, 6, 0, 0, 2, 76, 42, 3, 11, 9, 63, 26, 0, 8,...
## $ hosp   <dbl> 58, 920, 606, 819, 3007, 550, 233, 93, 103, 2162, 1684, 71, ...
## $ tests  <dbl> 0, 6095, 12517, 12080, 125886, 25756, 0, 5800, 2164, 72309, ...
## Rows: 12,042
## Columns: 14
## $ date   <date> 2020-01-22, 2020-01-22, 2020-01-23, 2020-01-23, 2020-01-24,...
## $ state  <chr> "MA", "WA", "MA", "WA", "MA", "WA", "MA", "WA", "MA", "WA", ...
## $ cases  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ deaths <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ hosp   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ tests  <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, ...
## $ cpm    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ dpm    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ hpm    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ tpm    <dbl> 0.0000000, 0.0000000, 0.1471796, 0.0000000, 0.0000000, 0.000...
## $ cpm7   <dbl> NA, NA, NA, NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ dpm7   <dbl> NA, NA, NA, NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ hpm7   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ tpm7   <dbl> NA, NA, NA, NA, NA, NA, 0.04205130, 0.00000000, 0.06307695, ...
## `summarise()` regrouping output by 'state' (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)

## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` regrouping output by 'date', 'cluster' (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)

## 
## Recency is defined as 2020-09-25 through current
## 
## Recency is defined as 2020-09-25 through current

## `summarise()` regrouping output by 'state', 'cluster', 'date' (override with `.groups` argument)

## `summarise()` ungrouping output (override with `.groups` argument)

## `summarise()` ungrouping output (override with `.groups` argument)

## `summarise()` ungrouping output (override with `.groups` argument)

Use Case 2: Download new data, apply existing segments, assess performance

A modified process gathers new data and assesses existing state-level clusters:

# Use existing segments with updated data
locDownload <- "./RInputFiles/Coronavirus/CV_downloaded_201025.csv"
test_old_201025 <- readRunCOVIDTrackingProject(thruLabel="Oct 24, 2020", 
                                               downloadTo=if (file.exists(locDownload)) NULL else locDownload,
                                               readFrom=locDownload, 
                                               compareFile=readFromRDS("test_hier5_201001")$dfRaw,
                                               useClusters=readFromRDS("test_hier5_201001")$useClusters
                                               )
## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_double(),
##   state = col_character(),
##   totalTestResultsSource = col_character(),
##   dataQualityGrade = col_character(),
##   lastUpdateEt = col_character(),
##   dateModified = col_datetime(format = ""),
##   checkTimeEt = col_character(),
##   dateChecked = col_datetime(format = ""),
##   fips = col_character(),
##   hash = col_character(),
##   grade = col_logical()
## )
## i Use `spec()` for the full column specifications.
## 
## File is unique by state and date
## 
## 
## Overall control totals in file:
## # A tibble: 1 x 3
##   positiveIncrease deathIncrease hospitalizedCurrently
##              <dbl>         <dbl>                 <dbl>
## 1          8531788        216646               8686442
## 
## *** COMPARISONS TO REFERENCE FILE: compareFile
## 
## Checkin for similarity of: column names
## In reference but not in current: 
## In current but not in reference: probableCases
## 
## Checkin for similarity of: states
## In reference but not in current: 
## In current but not in reference: 
## 
## Checkin for similarity of: dates
## In reference but not in current: 
## In current but not in reference: 2020-10-24 2020-10-23 2020-10-22 2020-10-21 2020-10-20 2020-10-19 2020-10-18 2020-10-17 2020-10-16 2020-10-15 2020-10-14 2020-10-13 2020-10-12 2020-10-11 2020-10-10 2020-10-09 2020-10-08 2020-10-07 2020-10-06 2020-10-05 2020-10-04 2020-10-03 2020-10-02 2020-10-01
## 
## *** Difference of at least 5 and difference is at least 1%:
## Joining, by = c("date", "name")
##         date             name newValue oldValue
## 1 2020-03-28 positiveIncrease    19925    19692
## 2 2020-03-28    deathIncrease      544      538
## 3 2020-03-29 positiveIncrease    19348    19581
## 4 2020-03-29    deathIncrease      515      521
## Joining, by = c("date", "name")
## Warning: Removed 24 row(s) containing missing values (geom_path).
## 
## 
## *** Difference of at least 5 and difference is at least 1%:
## Joining, by = c("state", "name")
##   state             name newValue oldValue
## 1    HI positiveIncrease    12469    12289
## Rows: 13,157
## Columns: 55
## $ date                        <date> 2020-10-24, 2020-10-24, 2020-10-24, 20...
## $ state                       <chr> "AK", "AL", "AR", "AS", "AZ", "CA", "CO...
## $ positive                    <dbl> 13535, 183276, 105318, 0, 236772, 89281...
## $ probableCases               <dbl> NA, 26330, 7105, NA, 5417, NA, 6492, 26...
## $ negative                    <dbl> 539585, 1138922, 1181805, 1616, 1462194...
## $ pending                     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ totalTestResultsSource      <chr> "totalTestsViral", "totalTestsViral", "...
## $ totalTestResults            <dbl> 552746, 1295868, 1280018, 1616, 1693549...
## $ hospitalizedCurrently       <dbl> 58, 920, 606, NA, 819, 3007, 550, 233, ...
## $ hospitalizedCumulative      <dbl> NA, 19595, 6707, NA, 21043, NA, 8557, 1...
## $ inIcuCurrently              <dbl> NA, NA, 242, NA, 191, 744, NA, NA, 21, ...
## $ inIcuCumulative             <dbl> NA, 2021, NA, NA, NA, NA, NA, NA, NA, N...
## $ onVentilatorCurrently       <dbl> 8, NA, 94, NA, 87, NA, NA, NA, 6, NA, N...
## $ onVentilatorCumulative      <dbl> NA, 1157, 808, NA, NA, NA, NA, NA, NA, ...
## $ recovered                   <dbl> 6939, 74439, 93977, NA, 39525, NA, 7463...
## $ dataQualityGrade            <chr> "A", "A", "A+", "D", "A+", "B", "A", "B...
## $ lastUpdateEt                <chr> "10/24/2020 03:59", "10/24/2020 11:00",...
## $ dateModified                <dttm> 2020-10-24 03:59:00, 2020-10-24 11:00:...
## $ checkTimeEt                 <chr> "10/23 23:59", "10/24 07:00", "10/23 20...
## $ death                       <dbl> 68, 2866, 1797, 0, 5869, 17311, 2076, 4...
## $ hospitalized                <dbl> NA, 19595, 6707, NA, 21043, NA, 8557, 1...
## $ dateChecked                 <dttm> 2020-10-24 03:59:00, 2020-10-24 11:00:...
## $ totalTestsViral             <dbl> 552746, 1295868, 1280018, 1616, NA, 176...
## $ positiveTestsViral          <dbl> 11644, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ negativeTestsViral          <dbl> 540786, NA, 1181805, NA, NA, NA, NA, NA...
## $ positiveCasesViral          <dbl> 13535, 156946, 98213, 0, 231355, 892810...
## $ deathConfirmed              <dbl> 68, 2680, 1640, NA, 5581, NA, NA, 3674,...
## $ deathProbable               <dbl> NA, 186, 157, NA, 288, NA, NA, 903, NA,...
## $ totalTestEncountersViral    <dbl> NA, NA, NA, NA, NA, NA, 1790404, NA, 49...
## $ totalTestsPeopleViral       <dbl> NA, NA, NA, NA, 1693549, NA, 1124409, N...
## $ totalTestsAntibody          <dbl> NA, NA, NA, NA, 312232, NA, 179232, NA,...
## $ positiveTestsAntibody       <dbl> NA, NA, NA, NA, NA, NA, 12741, NA, NA, ...
## $ negativeTestsAntibody       <dbl> NA, NA, NA, NA, NA, NA, 166491, NA, NA,...
## $ totalTestsPeopleAntibody    <dbl> NA, 63359, NA, NA, NA, NA, NA, NA, NA, ...
## $ positiveTestsPeopleAntibody <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ negativeTestsPeopleAntibody <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ totalTestsPeopleAntigen     <dbl> NA, NA, 46505, NA, NA, NA, NA, NA, NA, ...
## $ positiveTestsPeopleAntigen  <dbl> NA, NA, 7891, NA, NA, NA, NA, NA, NA, N...
## $ totalTestsAntigen           <dbl> NA, NA, 21856, NA, NA, NA, NA, NA, NA, ...
## $ positiveTestsAntigen        <dbl> NA, NA, 3300, NA, NA, NA, NA, NA, NA, N...
## $ fips                        <chr> "02", "01", "05", "60", "04", "06", "08...
## $ positiveIncrease            <dbl> 374, 2360, 1183, 0, 890, 5945, 1350, 0,...
## $ negativeIncrease            <dbl> 0, 5064, 11643, 0, 11213, 119941, 9351,...
## $ total                       <dbl> 553120, 1322198, 1287123, 1616, 1698966...
## $ totalTestResultsIncrease    <dbl> 0, 6095, 12517, 0, 12080, 125886, 25756...
## $ posNeg                      <dbl> 553120, 1322198, 1287123, 1616, 1698966...
## $ deathIncrease               <dbl> 0, 7, 15, 0, 4, 49, 6, 0, 0, 2, 76, 42,...
## $ hospitalizedIncrease        <dbl> 0, 0, 29, 0, 76, 0, 79, 0, 0, 0, 174, 9...
## $ hash                        <chr> "280ee400bd797c20b77218c9e54a0b6615f91a...
## $ commercialScore             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ negativeRegularScore        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ negativeScore               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ positiveScore               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ score                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ grade                       <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## 
## 
## Control totals - note that validState other than TRUE will be discarded
## 
## # A tibble: 2 x 6
##   validState   cases deaths  hosp     tests     n
##   <lgl>        <dbl>  <dbl> <dbl>     <dbl> <dbl>
## 1 FALSE        66880    888    NA    471313  1115
## 2 TRUE       8464908 215758    NA 130976369 12042
## Rows: 12,042
## Columns: 6
## $ date   <date> 2020-10-24, 2020-10-24, 2020-10-24, 2020-10-24, 2020-10-24,...
## $ state  <chr> "AK", "AL", "AR", "AZ", "CA", "CO", "CT", "DC", "DE", "FL", ...
## $ cases  <dbl> 374, 2360, 1183, 890, 5945, 1350, 0, 97, 160, 4471, 1846, 14...
## $ deaths <dbl> 0, 7, 15, 4, 49, 6, 0, 0, 2, 76, 42, 3, 11, 9, 63, 26, 0, 8,...
## $ hosp   <dbl> 58, 920, 606, 819, 3007, 550, 233, 93, 103, 2162, 1684, 71, ...
## $ tests  <dbl> 0, 6095, 12517, 12080, 125886, 25756, 0, 5800, 2164, 72309, ...
## Rows: 12,042
## Columns: 14
## $ date   <date> 2020-01-22, 2020-01-22, 2020-01-23, 2020-01-23, 2020-01-24,...
## $ state  <chr> "MA", "WA", "MA", "WA", "MA", "WA", "MA", "WA", "MA", "WA", ...
## $ cases  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ deaths <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ hosp   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ tests  <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, ...
## $ cpm    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ dpm    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ hpm    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ tpm    <dbl> 0.0000000, 0.0000000, 0.1471796, 0.0000000, 0.0000000, 0.000...
## $ cpm7   <dbl> NA, NA, NA, NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ dpm7   <dbl> NA, NA, NA, NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ hpm7   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ tpm7   <dbl> NA, NA, NA, NA, NA, NA, 0.04205130, 0.00000000, 0.06307695, ...
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` regrouping output by 'date', 'cluster' (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)

## 
## Recency is defined as 2020-09-25 through current
## 
## Recency is defined as 2020-09-25 through current

## `summarise()` regrouping output by 'state', 'cluster', 'date' (override with `.groups` argument)

## `summarise()` ungrouping output (override with `.groups` argument)

## `summarise()` ungrouping output (override with `.groups` argument)

## `summarise()` ungrouping output (override with `.groups` argument)

Use Case 3: Use existing data and explore clustering approach

A different clustering approach can be assessed using existing data. A common example would be exploring kmeans clustering with the previously processed state-level data:

# Test function for k-means clustering using the per capita data file previously created
test_km5_201025 <- readRunCOVIDTrackingProject(thruLabel="Oct 24, 2020", 
                                               dfPerCapita=test_hier6_201025$dfPerCapita,
                                               hierarchical=FALSE,
                                               minShape=3, 
                                               ratioDeathvsCase = 5, 
                                               ratioTotalvsShape = 0.5, 
                                               minDeath=100, 
                                               minCase=10000, 
                                               nCenters=5,
                                               testCenters=1:10, 
                                               iter.max=20,
                                               nstart=10, 
                                               seed=2008261400
                                               )
## `summarise()` regrouping output by 'state' (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## 
## Cluster means and counts
##                1    2    3     4    5
## .           9.00 8.00 5.00 20.00 9.00
## totalCases  1.17 0.78 0.86  0.91 0.47
## totalDeaths 3.34 3.50 6.32  1.41 1.49
## cases_3     0.01 0.03 0.07  0.01 0.03
## deaths_3    0.04 0.09 0.15  0.05 0.23
## cases_4     0.04 0.18 0.35  0.04 0.09
## deaths_4    0.38 1.50 2.27  0.49 1.22
## cases_5     0.05 0.18 0.17  0.05 0.11
## deaths_5    0.46 1.62 1.37  0.50 1.15
## cases_6     0.13 0.07 0.07  0.07 0.09
## deaths_6    0.40 0.63 0.41  0.38 0.67
## cases_7     0.31 0.12 0.10  0.17 0.14
## deaths_7    1.03 0.33 0.26  0.58 0.49
## cases_8     0.20 0.13 0.08  0.18 0.11
## deaths_8    1.26 0.25 0.25  0.83 0.45
## cases_9     0.14 0.13 0.06  0.21 0.11
## deaths_9    0.89 0.27 0.16  0.99 0.41
## cases_10    0.13 0.17 0.10  0.28 0.17
## deaths_10   0.55 0.31 0.13  1.16 0.34
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` regrouping output by 'date', 'cluster' (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)

## 
## Recency is defined as 2020-09-25 through current
## 
## Recency is defined as 2020-09-25 through current

## `summarise()` regrouping output by 'state', 'cluster', 'date' (override with `.groups` argument)

## `summarise()` ungrouping output (override with `.groups` argument)

## `summarise()` ungrouping output (override with `.groups` argument)

## `summarise()` ungrouping output (override with `.groups` argument)

The silhouette plot is suggestive that k-means may not be an ideal approach, or at least that there is no obviously optimal number of segments.

Use Case 4: Create list with burden and cluster data for a downstream purpose

combine_201025 <- readRunCOVIDTrackingProject(thruLabel="Oct 24, 2020", 
                                              dfPerCapita=test_hier6_201025$dfPerCapita,
                                              useClusters=readFromRDS("test_hier5_201001")$useClusters, 
                                              skipAssessmentPlots=TRUE
                                              )
str(combine_201025)
## List of 8
##  $ stateData           : tibble [51 x 3] (S3: tbl_df/tbl/data.frame)
##   ..$ state: chr [1:51] "AL" "AK" "AZ" "AR" ...
##   ..$ name : chr [1:51] "Alabama" "Alaska" "Arizona" "Arkansas" ...
##   ..$ pop  : num [1:51] 4858979 738432 6828065 2978204 39144818 ...
##  $ dfRaw               : NULL
##  $ dfFiltered          : NULL
##  $ dfPerCapita         : tibble [12,042 x 14] (S3: tbl_df/tbl/data.frame)
##   ..$ date  : Date[1:12042], format: "2020-01-22" "2020-01-22" ...
##   ..$ state : chr [1:12042] "MA" "WA" "MA" "WA" ...
##   ..$ cases : num [1:12042] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ deaths: num [1:12042] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ hosp  : num [1:12042] NA NA NA NA NA NA NA NA NA NA ...
##   ..$ tests : num [1:12042] 0 0 1 0 0 0 0 0 0 0 ...
##   ..$ cpm   : num [1:12042] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ dpm   : num [1:12042] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ hpm   : num [1:12042] NA NA NA NA NA NA NA NA NA NA ...
##   ..$ tpm   : num [1:12042] 0 0 0.147 0 0 ...
##   ..$ cpm7  : num [1:12042] NA NA NA NA NA NA 0 0 0 0 ...
##   ..$ dpm7  : num [1:12042] NA NA NA NA NA NA 0 0 0 0 ...
##   ..$ hpm7  : num [1:12042] NA NA NA NA NA NA NA NA NA NA ...
##   ..$ tpm7  : num [1:12042] NA NA NA NA NA ...
##  $ useClusters         : Named int [1:51] 1 2 1 2 1 3 4 5 5 2 ...
##   ..- attr(*, "names")= chr [1:51] "AK" "AL" "AR" "AZ" ...
##  $ plotData            : NULL
##  $ consolidatedPlotData: NULL
##  $ clCum               : NULL

The list is properly formatted (though lacking the plotting and cumulative components) such that it could be used by other functions that rely on the data being available in this format.

Saving key files as RDS

saveToRDS(test_hier5_201025, ovrWriteError=FALSE)
saveToRDS(test_old_201025, ovrWriteError=FALSE)
saveToRDS(test_km5_201025, ovrWriteError=FALSE)
saveToRDS(combine_201025, ovrWriteError=FALSE)